An Improved Algorithm of Logical Structure Reconstruction for Re-flowable Document Understanding

نویسندگان

  • Lin Zhao
  • Ning Li
  • Xin Peng
  • Qi Liang
چکیده

The basic idea of re-flowable document understanding and automatic typesetting is to generate logical documents by judging the hierarchical relationship of physical units and logical tags based on the identification of logical paragraph tags in re-flowable document. In order to overcome the shortages of conventional logical structure reconstruction methods, a novel logical structure reconstruction method of re-flowable document based on directed graph is proposed in this paper. This method extracts the logical structure from the template document and then utilizes directed graph's single-source shortest path algorithm to filter out redundant logical tags, thus solving the problem of logical structure reconstruction of a document. Experimental results show that the algorithm can effectively improve the accuracy of logical structure recognition.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

An Improved Algorithm for Network Reliability Evaluation

Binary Decision Diagram (BDD) is a data structure proved to be compact in representation and efficient in manipulation of Boolean formulas. Using Binary decision diagram in network reliability analysis has already been investigated by some researchers. In this paper we show how an exact algorithm for network reliability can be improved and implemented efficiently by using CUDD - Colorado Univer...

متن کامل

Stochastic attribute grammar model of document production and its use in document image decoding

Document Image Decoding (DID) refers to the process of document recognition within a communication theory framework. In this framework, a logical document structure is a message communicated by encoding the structure as an ideal image, transmitting the ideal image through a noisy channel, and decoding the degraded image into a logical structure as close to the original message as possible, on a...

متن کامل

An improved genetic algorithm for multidimensional optimization of precedence-constrained production planning and scheduling

Integration of production planning and scheduling is a class of problems commonly found in manufacturing industry. This class of problems associated with precedence constraint has been previously modeled and optimized by the authors, in which, it requires a multidimensional optimization at the same time: what to make, how many to make, where to make and the order to make. It is a combinatorial,...

متن کامل

Machine Learning for Reading Order Detection in Document Image Understanding

Document image understanding refers to logical and semantic analysis of document images in order to extract information understandable to humans and codify it into machine-readable form. Most of the studies on document image understanding have targeted the specific problem of associating layout components with logical labels, while less attention has been paid to the problem of extracting relat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015